Motor vehicle collisions reported by the New York City Police Department from January-August 2020. Each record represents an individual collision, including the date, time and location of the accident (borough, zip code, street name, latitude/longitude), vehicles and victims involved, and contributing factors.
Data Source : https://www.mavenanalytics.io/data-playground
import pandas as pd
import numpy as np
from pandas import Timestamp
import datetime as dt
from datetime import datetime
df= pd.read_csv("NYCAccidents2020.csv")
df.head(5)
| CRASH DATE | CRASH TIME | BOROUGH | ZIP CODE | LATITUDE | LONGITUDE | LOCATION | ON STREET NAME | CROSS STREET NAME | OFF STREET NAME | ... | CONTRIBUTING FACTOR VEHICLE 2 | CONTRIBUTING FACTOR VEHICLE 3 | CONTRIBUTING FACTOR VEHICLE 4 | CONTRIBUTING FACTOR VEHICLE 5 | COLLISION_ID | VEHICLE TYPE CODE 1 | VEHICLE TYPE CODE 2 | VEHICLE TYPE CODE 3 | VEHICLE TYPE CODE 4 | VEHICLE TYPE CODE 5 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2020-08-29 | 15:40:00 | BRONX | 10466.0 | 40.89210 | -73.833760 | POINT (-73.83376 40.8921) | PRATT AVENUE | STRANG AVENUE | NaN | ... | Unspecified | NaN | NaN | NaN | 4342908 | Sedan | Station Wagon/Sport Utility Vehicle | NaN | NaN | NaN |
| 1 | 2020-08-29 | 21:00:00 | BROOKLYN | 11221.0 | 40.69050 | -73.919914 | POINT (-73.919914 40.6905) | BUSHWICK AVENUE | PALMETTO STREET | NaN | ... | Unspecified | NaN | NaN | NaN | 4343555 | Sedan | Sedan | NaN | NaN | NaN |
| 2 | 2020-08-29 | 18:20:00 | NaN | NaN | 40.81650 | -73.946556 | POINT (-73.946556 40.8165) | 8 AVENUE | NaN | NaN | ... | NaN | NaN | NaN | NaN | 4343142 | Station Wagon/Sport Utility Vehicle | NaN | NaN | NaN | NaN |
| 3 | 2020-08-29 | 00:00:00 | BRONX | 10459.0 | 40.82472 | -73.892960 | POINT (-73.89296 40.82472) | NaN | NaN | 1047 SIMPSON STREET | ... | Unspecified | Unspecified | Unspecified | NaN | 4343588 | Station Wagon/Sport Utility Vehicle | Station Wagon/Sport Utility Vehicle | Sedan | Motorcycle | NaN |
| 4 | 2020-08-29 | 17:10:00 | BROOKLYN | 11203.0 | 40.64989 | -73.933890 | POINT (-73.93389 40.64989) | NaN | NaN | 4609 SNYDER AVENUE | ... | Unspecified | NaN | NaN | NaN | 4342953 | Sedan | Sedan | NaN | NaN | NaN |
5 rows × 29 columns
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 74881 entries, 0 to 74880 Data columns (total 29 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CRASH DATE 74881 non-null object 1 CRASH TIME 74881 non-null object 2 BOROUGH 49140 non-null object 3 ZIP CODE 49134 non-null float64 4 LATITUDE 68935 non-null float64 5 LONGITUDE 68935 non-null float64 6 LOCATION 68935 non-null object 7 ON STREET NAME 55444 non-null object 8 CROSS STREET NAME 35681 non-null object 9 OFF STREET NAME 19437 non-null object 10 NUMBER OF PERSONS INJURED 74881 non-null int64 11 NUMBER OF PERSONS KILLED 74881 non-null int64 12 NUMBER OF PEDESTRIANS INJURED 74881 non-null int64 13 NUMBER OF PEDESTRIANS KILLED 74881 non-null int64 14 NUMBER OF CYCLIST INJURED 74881 non-null int64 15 NUMBER OF CYCLIST KILLED 74881 non-null int64 16 NUMBER OF MOTORIST INJURED 74881 non-null int64 17 NUMBER OF MOTORIST KILLED 74881 non-null int64 18 CONTRIBUTING FACTOR VEHICLE 1 74577 non-null object 19 CONTRIBUTING FACTOR VEHICLE 2 59285 non-null object 20 CONTRIBUTING FACTOR VEHICLE 3 6765 non-null object 21 CONTRIBUTING FACTOR VEHICLE 4 1851 non-null object 22 CONTRIBUTING FACTOR VEHICLE 5 523 non-null object 23 COLLISION_ID 74881 non-null int64 24 VEHICLE TYPE CODE 1 74246 non-null object 25 VEHICLE TYPE CODE 2 53638 non-null object 26 VEHICLE TYPE CODE 3 6424 non-null object 27 VEHICLE TYPE CODE 4 1771 non-null object 28 VEHICLE TYPE CODE 5 503 non-null object dtypes: float64(3), int64(9), object(17) memory usage: 16.6+ MB
We need to make sure the data is clean before starting your analysis. As a reminder, we should check for:
How many duplicate transaction records are there?
dup_rows = df.duplicated().sum()
dup_rows
0
Drop the duplicated records.
# your code here
df = df.drop_duplicates()
df.head()
| CRASH DATE | CRASH TIME | BOROUGH | ZIP CODE | LATITUDE | LONGITUDE | LOCATION | ON STREET NAME | CROSS STREET NAME | OFF STREET NAME | ... | CONTRIBUTING FACTOR VEHICLE 2 | CONTRIBUTING FACTOR VEHICLE 3 | CONTRIBUTING FACTOR VEHICLE 4 | CONTRIBUTING FACTOR VEHICLE 5 | COLLISION_ID | VEHICLE TYPE CODE 1 | VEHICLE TYPE CODE 2 | VEHICLE TYPE CODE 3 | VEHICLE TYPE CODE 4 | VEHICLE TYPE CODE 5 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2020-08-29 | 15:40:00 | BRONX | 10466.0 | 40.89210 | -73.833760 | POINT (-73.83376 40.8921) | PRATT AVENUE | STRANG AVENUE | NaN | ... | Unspecified | NaN | NaN | NaN | 4342908 | Sedan | Station Wagon/Sport Utility Vehicle | NaN | NaN | NaN |
| 1 | 2020-08-29 | 21:00:00 | BROOKLYN | 11221.0 | 40.69050 | -73.919914 | POINT (-73.919914 40.6905) | BUSHWICK AVENUE | PALMETTO STREET | NaN | ... | Unspecified | NaN | NaN | NaN | 4343555 | Sedan | Sedan | NaN | NaN | NaN |
| 2 | 2020-08-29 | 18:20:00 | NaN | NaN | 40.81650 | -73.946556 | POINT (-73.946556 40.8165) | 8 AVENUE | NaN | NaN | ... | NaN | NaN | NaN | NaN | 4343142 | Station Wagon/Sport Utility Vehicle | NaN | NaN | NaN | NaN |
| 3 | 2020-08-29 | 00:00:00 | BRONX | 10459.0 | 40.82472 | -73.892960 | POINT (-73.89296 40.82472) | NaN | NaN | 1047 SIMPSON STREET | ... | Unspecified | Unspecified | Unspecified | NaN | 4343588 | Station Wagon/Sport Utility Vehicle | Station Wagon/Sport Utility Vehicle | Sedan | Motorcycle | NaN |
| 4 | 2020-08-29 | 17:10:00 | BROOKLYN | 11203.0 | 40.64989 | -73.933890 | POINT (-73.93389 40.64989) | NaN | NaN | 4609 SNYDER AVENUE | ... | Unspecified | NaN | NaN | NaN | 4342953 | Sedan | Sedan | NaN | NaN | NaN |
5 rows × 29 columns
#Convert object to datatime
df['CRASH DATE'] = pd.to_datetime(df['CRASH DATE'])
df['CRASH TIME'] = pd.to_datetime(df['CRASH TIME'],format='%H:%M:%S').dt.time
How many missing values are there?
df.isnull().sum() #Find the number of null per each columns
CRASH DATE 0 CRASH TIME 0 BOROUGH 25741 ZIP CODE 25747 LATITUDE 5946 LONGITUDE 5946 LOCATION 5946 ON STREET NAME 19437 CROSS STREET NAME 39200 OFF STREET NAME 55444 NUMBER OF PERSONS INJURED 0 NUMBER OF PERSONS KILLED 0 NUMBER OF PEDESTRIANS INJURED 0 NUMBER OF PEDESTRIANS KILLED 0 NUMBER OF CYCLIST INJURED 0 NUMBER OF CYCLIST KILLED 0 NUMBER OF MOTORIST INJURED 0 NUMBER OF MOTORIST KILLED 0 CONTRIBUTING FACTOR VEHICLE 1 304 CONTRIBUTING FACTOR VEHICLE 2 15596 CONTRIBUTING FACTOR VEHICLE 3 68116 CONTRIBUTING FACTOR VEHICLE 4 73030 CONTRIBUTING FACTOR VEHICLE 5 74358 COLLISION_ID 0 VEHICLE TYPE CODE 1 635 VEHICLE TYPE CODE 2 21243 VEHICLE TYPE CODE 3 68457 VEHICLE TYPE CODE 4 73110 VEHICLE TYPE CODE 5 74378 dtype: int64
#Drop the missing values in 'LOCATION' column
df = df[df['LOCATION'].notna()]
df = df[df['CONTRIBUTING FACTOR VEHICLE 1'].notna()]
df.rename(columns = {"CRASH DATE": "crash_date"}, inplace=True)
df.rename(columns = {"CRASH TIME": "crash_time"}, inplace=True)
df.rename(columns = {"BOROUGH": "borough"}, inplace=True)
df.rename(columns = {"ZIP CODE": "zip_code"}, inplace=True)
df.rename(columns = {"LATITUDE": "latitude"}, inplace=True)
df.rename(columns = {"LONGITUDE": "longitude"}, inplace=True)
df.rename(columns = {"LOCATION": "location"}, inplace=True)
df.rename(columns = {"ON STREET NAME": "on_street_name"}, inplace=True)
df.rename(columns = {"CROSS STREET NAME": "cross_street_name"}, inplace=True)
df.rename(columns = {"OFF STREET NAME": "off_street_name"}, inplace=True)
df.rename(columns = {"NUMBER OF PERSONS INJURED": "number_of_persons_injured"}, inplace=True)
df.rename(columns = {"NUMBER OF PERSONS KILLED": "number_of_persons_killed"}, inplace=True)
df.rename(columns = {"NUMBER OF PEDESTRIANS INJURED": "number_of_pedestrians_injured"}, inplace=True)
df.rename(columns = {"NUMBER OF PEDESTRIANS KILLED": "number_of_pedestrians_killed"}, inplace=True)
df.rename(columns = {"NUMBER OF CYCLIST INJURED": "number_of_cyclist_injured"}, inplace=True)
df.rename(columns = {"NUMBER OF CYCLIST KILLED": "number_of_cyclist_killed"}, inplace=True)
df.rename(columns = {"NUMBER OF MOTORIST INJURED": "number_of_motorist_injured"}, inplace=True)
df.rename(columns = {"NUMBER OF MOTORIST KILLED": "number_of_motorist_killed"}, inplace=True)
df.rename(columns = {"CONTRIBUTING FACTOR VEHICLE 1": "contributing_factor_vehicle_1"}, inplace=True)
df.rename(columns = {"CONTRIBUTING FACTOR VEHICLE 2": "contributing_factor_vehicle_2"}, inplace=True)
df.rename(columns = {"CONTRIBUTING FACTOR VEHICLE 3": "contributing_factor_vehicle_3"}, inplace=True)
df.rename(columns = {"CONTRIBUTING FACTOR VEHICLE 4": "contributing_factor_vehicle_4"}, inplace=True)
df.rename(columns = {"CONTRIBUTING FACTOR VEHICLE 5": "contributing_factor_vehicle_5"}, inplace=True)
df.rename(columns = {"COLLISION_ID": "collision_id"}, inplace=True)
df.rename(columns = {"VEHICLE TYPE CODE 1": "vehicle_type_code_1"}, inplace=True)
df.rename(columns = {"VEHICLE TYPE CODE 2": "vehicle_type_code_2"}, inplace=True)
df.rename(columns = {"VEHICLE TYPE CODE 3": "vehicle_type_code_3"}, inplace=True)
df.rename(columns = {"VEHICLE TYPE CODE 4": "vehicle_type_code_4"}, inplace=True)
df.rename(columns = {"VEHICLE TYPE CODE 5": "vehicle_type_code_5"}, inplace=True)
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 68647 entries, 0 to 74880 Data columns (total 29 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 crash_date 68647 non-null datetime64[ns] 1 crash_time 68647 non-null object 2 borough 47517 non-null object 3 zip_code 47511 non-null float64 4 latitude 68647 non-null float64 5 longitude 68647 non-null float64 6 location 68647 non-null object 7 on_street_name 49872 non-null object 8 cross_street_name 32734 non-null object 9 off_street_name 18775 non-null object 10 number_of_persons_injured 68647 non-null int64 11 number_of_persons_killed 68647 non-null int64 12 number_of_pedestrians_injured 68647 non-null int64 13 number_of_pedestrians_killed 68647 non-null int64 14 number_of_cyclist_injured 68647 non-null int64 15 number_of_cyclist_killed 68647 non-null int64 16 number_of_motorist_injured 68647 non-null int64 17 number_of_motorist_killed 68647 non-null int64 18 contributing_factor_vehicle_1 68647 non-null object 19 contributing_factor_vehicle_2 54520 non-null object 20 contributing_factor_vehicle_3 6200 non-null object 21 contributing_factor_vehicle_4 1704 non-null object 22 contributing_factor_vehicle_5 489 non-null object 23 collision_id 68647 non-null int64 24 vehicle_type_code_1 68274 non-null object 25 vehicle_type_code_2 49155 non-null object 26 vehicle_type_code_3 5885 non-null object 27 vehicle_type_code_4 1629 non-null object 28 vehicle_type_code_5 471 non-null object dtypes: datetime64[ns](1), float64(3), int64(9), object(16) memory usage: 15.7+ MB
Data cleaning finished!! 👏 Now! We are ready to answer questions and draw conclusions using our data. 👌 🍀
How many accidents were registered by the police in New York City in 2020 by date?
fig = px.line(dfg, x="crash_date", y="count"
,hover_data={"crash_date": "|%B %d, %Y"}
,markers=True
,color_discrete_sequence=px.colors.diverging.PRGn
,template = "plotly_white"
)
fig.update_layout(
title="Number of accidents by Date(Jan-Aug2020)"
,xaxis_title="date"
,yaxis_title="Number of accidents"
)
fig.update_xaxes(
dtick="M1",
tickformat="%b\n%Y",
ticklabelmode="period")
fig.show()
fig3=px.bar(df_stack
,x='Percentage'
,y='year'
,color = 'month'
, orientation='h'
)
fig3.update_layout(title = "Percentage share of Accident by Month",
template = 'simple_white', xaxis_title = '%',
yaxis_title = 'year',
legend_title_text='Month')
fig3
As we can observe, the number of accidents decreases since the end of February. One reason could be that fewer people are driving to work in these months during to COVID.
As we did with months, we can analyze the distribution of car accidents according to the time and week by using a bar plot as well.
dfg = dfg.groupby(['hour','month_']).count().reset_index()
dfg.rename(columns = {"location": "count"}, inplace=True)
fig = px.bar(dfg, x="hour", y="count",color='month_')
fig.update_layout(
title="Number of accidents by Time(Jan-Aug2020)"
,xaxis_title="Time"
,yaxis_title="Number of accidents"
,legend_title_text='Month'
)
fig.show()
As we can observe in the plot, the greater number of accidents occur in early-morning hours 14–18. Accidents tend to be more severe in the evening.
fig = px.bar(dfg, x="weekday", y="count",color='month_')
fig.update_layout(
title="Number of accidents by Week(Jan-Aug2020)"
,xaxis_title="Time"
,yaxis_title="Number of accidents"
)
fig.show()
fig = px.bar(dfg, x="weekday", y="count")
fig.update_layout(
title="Average Number of accidents by Week(Jan-Aug2020)"
,xaxis_title="Time"
,yaxis_title="Number of accidents")
fig.show()
As shown in the plot above, the number of car accidents decrease at the weekend. Weekdays present around an average of 280-300 car accidents per day, around 40 more accidents than on weekends.
The data we are analyzing contains information related to (1) victims,(2) Contribution Factor, (3)vehicles. Regarding the type of accident the data frame includes information such as the number of injuries and killed,the contribution factor of the accident and the vehicles type involved in the accident.
The data frame includes information about how many victims were injuries and killed in each car accident. We can easily represent the percentage of injuries and kill using a pie plot as follows:
import plotly.express as px
df['number_of_injured']= df['number_of_persons_injured']
df.loc[df['number_of_persons_injured'] > 2, 'number_of_injured'] = '3-5'
df.loc[df['number_of_persons_injured'] > 5, 'number_of_injured'] = 'More than 5'
fig = px.pie(df, values='accident_count', names='number_of_injured', title='Injuries in 2020',color_discrete_sequence=px.colors.diverging.PRGn)
fig.show()
The plot shows that 73% of the accident did not have victims injured and 21% of accident have one victims injured.
import plotly.express as px
df['number_of_killed']= df['number_of_persons_killed']
df.loc[df['number_of_persons_killed'] > 2, 'number_of_killed'] = '3-5'
df.loc[df['number_of_persons_killed'] > 5, 'number_of_killed'] = 'More than 5'
fig = px.pie(df, values='accident_count', names='number_of_killed', title='Killed in 2020',color_discrete_sequence=px.colors.diverging.PRGn)
fig.show()
Less than 1% of the accident have victims killed in the car accident.
taba = pd.crosstab(df.on_street_name, df.month_, values=df.location, aggfunc='count').round(0)
taba = taba.sort_values(by='January',ascending=False)
tabb = pd.crosstab(df.on_street_name, df.month_, values=df.location, aggfunc='count',normalize='columns').round(4)*100
tab = (
pd.concat([taba, tabb],axis = 1, keys = ['count', '%'])
.swaplevel(axis = 1)
.sort_index(axis = 1, ascending=[True, False])
.rename_axis(['month_', 'on_street_name'], axis = 1)
)
tab.head(10)
| month_ | January | February | March | April | May | June | July | August | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| on_street_name | count | % | count | % | count | % | count | % | count | % | count | % | count | % | count | % |
| on_street_name | ||||||||||||||||
| BELT PARKWAY | 166.0 | 1.71 | 209.0 | 2.25 | 170.0 | 2.29 | 68.0 | 2.48 | 103.0 | 2.54 | 120.0 | 2.39 | 145.0 | 2.44 | 137.0 | 2.42 |
| BROOKLYN QUEENS EXPRESSWAY | 141.0 | 1.45 | 129.0 | 1.39 | 115.0 | 1.55 | 40.0 | 1.46 | 51.0 | 1.26 | 65.0 | 1.29 | 65.0 | 1.09 | 88.0 | 1.55 |
| GRAND CENTRAL PKWY | 128.0 | 1.32 | 114.0 | 1.23 | 87.0 | 1.17 | 29.0 | 1.06 | 38.0 | 0.94 | 49.0 | 0.97 | 71.0 | 1.19 | 48.0 | 0.85 |
| FDR DRIVE | 125.0 | 1.29 | 108.0 | 1.16 | 129.0 | 1.74 | 77.0 | 2.81 | 65.0 | 1.60 | 48.0 | 0.95 | 62.0 | 1.04 | 72.0 | 1.27 |
| LONG ISLAND EXPRESSWAY | 125.0 | 1.29 | 137.0 | 1.47 | 106.0 | 1.43 | 42.0 | 1.53 | 46.0 | 1.13 | 76.0 | 1.51 | 93.0 | 1.56 | 78.0 | 1.38 |
| MAJOR DEEGAN EXPRESSWAY | 111.0 | 1.14 | 111.0 | 1.19 | 79.0 | 1.07 | 38.0 | 1.39 | 47.0 | 1.16 | 51.0 | 1.01 | 44.0 | 0.74 | 73.0 | 1.29 |
| BROADWAY | 101.0 | 1.04 | 103.0 | 1.11 | 80.0 | 1.08 | 27.0 | 0.99 | 32.0 | 0.79 | 70.0 | 1.39 | 73.0 | 1.23 | 73.0 | 1.29 |
| ATLANTIC AVENUE | 95.0 | 0.98 | 101.0 | 1.09 | 84.0 | 1.13 | 31.0 | 1.13 | 31.0 | 0.76 | 55.0 | 1.09 | 61.0 | 1.02 | 54.0 | 0.95 |
| CROSS BRONX EXPY | 89.0 | 0.92 | 83.0 | 0.89 | 85.0 | 1.15 | 24.0 | 0.88 | 54.0 | 1.33 | 48.0 | 0.95 | 51.0 | 0.86 | 72.0 | 1.27 |
| 3 AVENUE | 86.0 | 0.89 | 81.0 | 0.87 | 80.0 | 1.08 | 16.0 | 0.58 | 27.0 | 0.66 | 48.0 | 0.95 | 43.0 | 0.72 | 37.0 | 0.65 |
taba = pd.crosstab(df.contributing_factor_vehicle_1, df.month_, values=df.location, aggfunc='count').round(0)
taba = taba.sort_values(by='January',ascending=False)
tabb = pd.crosstab(df.contributing_factor_vehicle_1, df.month_, values=df.location, aggfunc='count',normalize='columns').round(4)*100
tab = (
pd.concat([taba, tabb],axis = 1, keys = ['count', '%'])
.swaplevel(axis = 1)
.sort_index(axis = 1, ascending=[True, False])
.rename_axis(['month_', 'contributing_factor_vehicle_1'], axis = 1)
)
tab.head(10)
| month_ | January | February | March | April | May | June | July | August | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| contributing_factor_vehicle_1 | count | % | count | % | count | % | count | % | count | % | count | % | count | % | count | % |
| contributing_factor_vehicle_1 | ||||||||||||||||
| Driver Inattention/Distraction | 3398.0 | 25.93 | 3387.0 | 27.10 | 2803.0 | 27.54 | 919.0 | 24.36 | 1367.0 | 24.19 | 1738.0 | 24.91 | 2152.0 | 25.48 | 2042.0 | 25.46 |
| Unspecified | 3335.0 | 25.45 | 3074.0 | 24.60 | 2523.0 | 24.79 | 1021.0 | 27.06 | 1530.0 | 27.07 | 1929.0 | 27.65 | 2414.0 | 28.58 | 2323.0 | 28.97 |
| Failure to Yield Right-of-Way | 1002.0 | 7.65 | 944.0 | 7.55 | 663.0 | 6.51 | 203.0 | 5.38 | 314.0 | 5.56 | 420.0 | 6.02 | 509.0 | 6.03 | 477.0 | 5.95 |
| Following Too Closely | 961.0 | 7.33 | 938.0 | 7.51 | 754.0 | 7.41 | 150.0 | 3.98 | 310.0 | 5.49 | 384.0 | 5.50 | 463.0 | 5.48 | 420.0 | 5.24 |
| Backing Unsafely | 588.0 | 4.49 | 559.0 | 4.47 | 456.0 | 4.48 | 153.0 | 4.06 | 200.0 | 3.54 | 234.0 | 3.35 | 265.0 | 3.14 | 272.0 | 3.39 |
| Passing or Lane Usage Improper | 530.0 | 4.04 | 504.0 | 4.03 | 379.0 | 3.72 | 124.0 | 3.29 | 188.0 | 3.33 | 265.0 | 3.80 | 321.0 | 3.80 | 288.0 | 3.59 |
| Passing Too Closely | 528.0 | 4.03 | 497.0 | 3.98 | 390.0 | 3.83 | 113.0 | 2.99 | 194.0 | 3.43 | 235.0 | 3.37 | 294.0 | 3.48 | 307.0 | 3.83 |
| Other Vehicular | 378.0 | 2.88 | 364.0 | 2.91 | 242.0 | 2.38 | 109.0 | 2.89 | 182.0 | 3.22 | 230.0 | 3.30 | 275.0 | 3.26 | 250.0 | 3.12 |
| Unsafe Lane Changing | 338.0 | 2.58 | 348.0 | 2.78 | 260.0 | 2.55 | 57.0 | 1.51 | 107.0 | 1.89 | 126.0 | 1.81 | 151.0 | 1.79 | 149.0 | 1.86 |
| Turning Improperly | 247.0 | 1.88 | 251.0 | 2.01 | 219.0 | 2.15 | 59.0 | 1.56 | 100.0 | 1.77 | 117.0 | 1.68 | 162.0 | 1.92 | 148.0 | 1.85 |
taba = pd.crosstab(df.vehicle_type_code_1, df.month_, values=df.location, aggfunc='count').round(0)
taba = taba.sort_values(by='January',ascending=False)
tabb = pd.crosstab(df.vehicle_type_code_1, df.month_, values=df.location, aggfunc='count',normalize='columns').round(4)*100
tab = (
pd.concat([taba, tabb],axis = 1, keys = ['count', '%'])
.swaplevel(axis = 1)
.sort_index(axis = 1, ascending=[True, False])
.rename_axis(['month_', 'vehicle_type_code_1'], axis = 1)
)
tab.head(10)
| month_ | January | February | March | April | May | June | July | August | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| vehicle_type_code_1 | count | % | count | % | count | % | count | % | count | % | count | % | count | % | count | % |
| vehicle_type_code_1 | ||||||||||||||||
| Sedan | 5678.0 | 43.55 | 5482.0 | 44.08 | 4537.0 | 44.83 | 1887.0 | 50.37 | 2725.0 | 48.50 | 3295.0 | 47.47 | 4028.0 | 47.97 | 3876.0 | 48.60 |
| Station Wagon/Sport Utility Vehicle | 5265.0 | 40.38 | 4794.0 | 38.55 | 3877.0 | 38.31 | 1324.0 | 35.34 | 2022.0 | 35.99 | 2459.0 | 35.43 | 2858.0 | 34.04 | 2817.0 | 35.32 |
| Taxi | 596.0 | 4.57 | 672.0 | 5.40 | 455.0 | 4.50 | 86.0 | 2.30 | 133.0 | 2.37 | 172.0 | 2.48 | 239.0 | 2.85 | 224.0 | 2.81 |
| Pick-up Truck | 375.0 | 2.88 | 348.0 | 2.80 | 299.0 | 2.95 | 94.0 | 2.51 | 102.0 | 1.82 | 137.0 | 1.97 | 184.0 | 2.19 | 181.0 | 2.27 |
| Box Truck | 249.0 | 1.91 | 250.0 | 2.01 | 199.0 | 1.97 | 63.0 | 1.68 | 118.0 | 2.10 | 132.0 | 1.90 | 153.0 | 1.82 | 135.0 | 1.69 |
| Bus | 249.0 | 1.91 | 225.0 | 1.81 | 152.0 | 1.50 | 25.0 | 0.67 | 48.0 | 0.85 | 49.0 | 0.71 | 74.0 | 0.88 | 73.0 | 0.92 |
| Tractor Truck Diesel | 87.0 | 0.67 | 104.0 | 0.84 | 71.0 | 0.70 | 27.0 | 0.72 | 46.0 | 0.82 | 56.0 | 0.81 | 59.0 | 0.70 | 61.0 | 0.76 |
| Van | 58.0 | 0.44 | 70.0 | 0.56 | 55.0 | 0.54 | 20.0 | 0.53 | 35.0 | 0.62 | 54.0 | 0.78 | 59.0 | 0.70 | 39.0 | 0.49 |
| Bike | 57.0 | 0.44 | 57.0 | 0.46 | 75.0 | 0.74 | 37.0 | 0.99 | 94.0 | 1.67 | 145.0 | 2.09 | 163.0 | 1.94 | 156.0 | 1.96 |
| Ambulance | 43.0 | 0.33 | 49.0 | 0.39 | 41.0 | 0.41 | 24.0 | 0.64 | 36.0 | 0.64 | 27.0 | 0.39 | 53.0 | 0.63 | 34.0 | 0.43 |
(1) In most accidents, Sedan,Station Wagon/Sport Utility, or Taxi vehicles were involved. Nearly half of the accident involve Sedan
(2) Most of the accident did not have victims injured in car accidents in 2020 (73%).
(3) Accidents tend to be more severe during night, late-evening, and weekends.
(4) Driver Inattention/Distraction is the main reason for the car accident (around 25%).
The best way to analyze spacial data is by using maps. Folium is a python library that helps you create several types of Leaflet maps. We can easily generate a map of New York City, creating a Folium Map object. The location argument allows to center the map in a specific location (in our case New York City). We can also provide an initial zoom level into that location to zoom the map into the center.
import folium
df['store_coords'] = None
df['lon'] = None
df['lan'] = None
# we will use regex to parse out liquor store coordinates
import re
for i, row in df.iterrows():
store_loc_str = row['location']
store_coord_str = re.search(r'\((.*?)\)',store_loc_str)
if store_coord_str is None:
continue
store_coord_str = store_coord_str.group(1)
store_coords = store_coord_str.split(' ')
store_coords = [round(float(numeric_string),4) for numeric_string in store_coords]
store_coords = store_coords[::-1]
df['lon'] = store_coords[0]
if len(store_coords) == 2:
df.at[i, 'store_coords'] = store_coords
df['lan'] = store_coords[1]
store_coords_lst = df['store_coords'].tolist()
store_coords_lst = [i for i in store_coords_lst if i]
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
%matplotlib inline
import folium
from folium import plugins
liquor_map = folium.Map(
location=[df['latitude'].mean(), df['longitude'].mean()],
zoom_start = 15,
prefer_canvas = True,
disable_3d = True
)
liquor_map.add_child(
plugins.HeatMap(
data = store_coords_lst,
radius = 15
)
)
from folium.plugins import HeatMapWithTime
df_map = folium.Map(location=[df['latitude'].mean(), df['longitude'].mean()], zoom_start=12)
hour_list = [[] for _ in range(8)]
for lat,log,month in zip(df.latitude,df.longitude,df.month):
hour_list[month-1].append([lat,log])
index = ['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug']
HeatMapWithTime(hour_list, index,auto_play=True).add_to(df_map)
df_map
!jupyter nbconvert --to html NYC_accidents_v3.ipynb
[NbConvertApp] Converting notebook NYC_accidents_v3.ipynb to html [NbConvertApp] Writing 9686701 bytes to NYC_accidents_v3.html
Looking at the above timeline, we can observe how the number of accidents per month
Thanks for reading!!! 😊 🍀